1 My Video

2 Introduction

Fortnite is a popular online battle royale video game made by Epic Games where up to 100 people fight each other. The objective of the game is to be the last person or team alive and achieve the epic victory royale. To do so, players must eliminate each other by using guns, explosives, or other means.

An Epic Victory Royale

An Epic Victory Royale

2.1 My Interest in the Data

I have been an avid gamer all my life and have played coutless video games, but Fortnite might be my favorite so far. I have been playing Fortnite since March of 2018 and still enjoy playing it to this date. There are so many unique mechanics to the game that make each match unique and refreshing. As such, I thought a statistical analysis on my favorite game using my own gameplay would be interesting.

2.2 The Data

To record the data, I played 50 matches in the “Duos” game mode and recorded the amount of damage dealt to enemies, the number of eliminations I got, and result of the match (win or lose). The former two statistics are the important ones. In the “Duos” game mode, you are teamed with another person, and you two must fight other teams of two until either your team is eliminated or your team is the last one standing.

fn = read.csv("fn_stats.csv")
head(fn)

2.3 The Variables

The independent variable is the number of elimations per game. For my sample of 50 games, it ranges from 0 to 15.

fn$Eliminations
##  [1]  6  2  6  2  1  0  8  0  2  9  1  2  6  8  5  1  4  2  2  4  3  0  2
## [24]  1  1  1  2  7  2  2  4  3  0  1  1  0  3  2  1  4  9  1  5  1  2 15
## [47]  5  8  7  3

The dependent variable is the damage dealt to enemies per game. For my sample of 50 games, it ranges from 40 to 2684.

fn$DamageDealt
##  [1] 1410  499 1445  592  341  271 1202  154  336 1792  691  430 1250 1866
## [15] 1130   96  689  383  707  690  628  140  460  341  333  292  648 1550
## [29]  359  409  927  714   40  286  269   78  609  355  248  947 2040  293
## [43] 1418  405  501 2684  510 2040  953  625

2.4 The Problem to be Solved

I want to find the statistical relationship between damage dealt to enemies and the number of eliminations per match.

2.5 Preliminary Plots

library(s20x)

fn_no_gr = fn[,-3] # Data frame without GameResult column
pairs20x(fn_no_gr)

library(ggplot2)

g = ggplot(fn, aes(x = Eliminations, y = DamageDealt)) 
g = g + geom_point(aes(colour = factor(fn$GameResult))) 
g = g + geom_smooth(method = "loess")
g = g + ggtitle("Damage Dealt vs. Eliminations")
g = g + ylab("Damage Dealt")
g

Visually speaking, the data appears to be approximately linear. The following statistical analysis will determine the extent of this linearity.

3 The Theory Behind Simple Linear Regression

Simple linear regression (SLR) is a type of probabilistic model, which means it will it will take randomness in account. SLR plots a straight line representing the mean value of \(y\) for a given \(x\) and any data points that deviate from that line is accounted for in an error term \(\epsilon\). The SLR model is written as \[y = \beta_0 + \beta_1x + \epsilon_i,\] where \(\beta_0\) and \(\beta_1\) are constant parameters, \(\beta_0 + \beta_1x\) is the mean value of \(y\) for any given \(x\), and \(\epsilon_i\) is the error term. If we assume that there are positive and negative deviations from the line of means, with \(E(\epsilon) = 0\), the mean value of y is \[ \begin{align} E(y) &= E(\beta_0 + \beta_1x + \epsilon_i)\\ &= \beta_0 + \beta_1x + E(\epsilon_i)\\ &= \beta_0 + \beta_1x.\\ \end{align} \]

Thus, the mean value of \(y\) for any given \(x\) is a straight line with y-intercept \(\beta_0\) and slope \(\beta_1\).

To find the SLR model for the data, estimators for \(\beta_0\) and \(\beta_1\) must be found. However, these estimators will depend on the probability distribution on \(\epsilon\). Consequently, we must operate under these four assumtions about \(\epsilon\) (Mendenhall):

  1. \(E(\epsilon) = 0\).
  2. \(V(\epsilon)\) is constant.
  3. The probability distribution of \(\epsilon\) is normal.
  4. The \(\epsilon\) from different observations are independent from one another.

4 Estimating the Parameters

4.1 Method of Least Squares

fn.lm = lm(DamageDealt ~ Eliminations, data = fn)
summary(fn.lm)
## 
## Call:
## lm(formula = DamageDealt ~ Eliminations, data = fn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537.13  -81.85    5.18   91.70  440.56 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   126.615     40.291   3.142  0.00287 ** 
## Eliminations  184.103      8.925  20.628  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 191.7 on 48 degrees of freedom
## Multiple R-squared:  0.8986, Adjusted R-squared:  0.8965 
## F-statistic: 425.5 on 1 and 48 DF,  p-value: < 2.2e-16

This gives us the estimates \(\hat{\beta_0} =\) 126.615 and \(\hat{\beta_1} =\) 184.103.

4.2 Confidence Intervals for the Estimates

ciReg(fn.lm, conf.level = 0.95)
##              95 % C.I.lower    95 % C.I.upper
## (Intercept)        45.60404          207.6254
## Eliminations      166.15851          202.0482

Thus, we are 95% confident that the true, underlying value of \(\beta_0\) is in the interval (45.604, 207.625) and we are 95% confident the the true, underlying value of \(\beta_1\) is in the interval (166.159, 202.048).

4.3 The Least Squares Estimates

The SLR model is: \[\hat{\beta_0} + \hat{\beta_1}x_i = 126.615 + (184.103)x_i\]

This means that when I get 0 eliminations, my estimated damage done to enemies is 126.615, and after every elimination, I do an estimated additional 184.103 damage.

5 Verifying Assumtions

We must verify these assumptions in order to demonstrate that a straight line is best fit for the data gathered.

plot(DamageDealt ~ Eliminations, 
     bg = "Blue", 
     pch = 21, 
     cex = 1.2,
     ylim = c(0, 1.1 * max(DamageDealt)),
     xlim = c(0, 1.1 * max(Eliminations)),
     main = "Fitted Line of Damage Dealt vs. Eliminations",
     data = fn)

abline(fn.lm)

5.1 Plot of Residuals

Residuals are the distance from the individual data points to the fitted line. From this, we can calculate the residual sum of squares (RSS).

plot(DamageDealt ~ Eliminations, 
     bg = "Blue", 
     pch = 21, 
     cex = 1.2,
     ylim = c(0, 1.1 * max(DamageDealt)),
     xlim = c(0, 1.1 * max(Eliminations)),
     main = "Residuals of Damage Dealt vs. Eliminations",
     data = fn)

abline(fn.lm)

yhat = with(fn, predict(fn.lm, data.frame(Eliminations)))
with(fn, {segments(Eliminations, DamageDealt, Eliminations, yhat)})

5.2 Plot of Means

We will also plot the difference in the mean of the damage dealt per game and the fitted line. From this, we are able to calculate the model sum of squares.

plot(DamageDealt ~ Eliminations, 
     bg = "Blue", 
     pch = 21, 
     cex = 1.2,
     ylim = c(0, 1.1 * max(DamageDealt)),
     xlim = c(0, 1.1 * max(Eliminations)),
     main = "Mean of Damage Dealt vs. Eliminations",
     data = fn)

abline(fn.lm)
abline(h = mean(fn$DamageDealt))
with(fn, segments(Eliminations, mean(DamageDealt), Eliminations, yhat, col = "Red"))

5.3 Plot of Means with Total Deviation Line Segments

We will also plot the difference in the mean of the damage dealt per game and the individual data points. From this, we can calculate the total sum of squares (TSS).

plot(DamageDealt ~ Eliminations, 
     bg = "Blue", 
     pch = 21, 
     cex = 1.2,
     ylim = c(0, 1.1 * max(DamageDealt)),
     xlim = c(0, 1.1 * max(Eliminations)),
     main = "Total Deviation Line Segments of Damage Dealt vs. Eliminations",
     data = fn)

abline(h = mean(fn$DamageDealt))
with(fn, segments(Eliminations, DamageDealt, Eliminations, mean(DamageDealt), col = "Green"))

5.4 Calculating RSS, MSS, TSS, and R-Squared

\[RSS = \sum_{i = 0}^n(y_i - \hat{y})^2\] \[MSS = \sum_{i = 0}^n(\hat{y} - \bar{y})^2\] \[TSS = \sum_{i = 0}^n(y_i - \bar{y})^2\]

RSS = with(fn, sum((DamageDealt - yhat)^2))
RSS
## [1] 1763451
MSS = with(fn, sum((yhat - mean(DamageDealt))^2))
MSS
## [1] 15632615
TSS = with(fn, sum((DamageDealt - mean(DamageDealt))^2))
TSS
## [1] 17396066

\(R^2\) is equal to \(\frac{MSS}{TSS}\), and the closer \(R^2\) is to 1, the better the fit of the trend line.

MSS/TSS
## [1] 0.8986293

Since \(R^2 = 0.8986293\), we can see that the trend line is well-fitted for the data set.

5.5 Lowess Smoother Scatter Plot of Damage Dealt vs. Eliminations

trendscatter(DamageDealt ~ Eliminations, f = 0.5, data = fn, main = "Damage Dealt vs. Eliminations")

This plot shows red dotted lines depicting the error where the trend line can be found. Even with error taken into account, we still note a linear trend.

5.6 Plotting Residuals vs. Eliminations

fn.lm = with(fn, lm(DamageDealt ~ Eliminations))
residuals = residuals(fn.lm)

plot(x = fn$Eliminations,
     y = residuals,
     xlab = "Eliminations",
     ylab = "Residuals",
     ylim = c(-1.5 * max(residuals), 1.5 * max(residuals)),
     main = "Residuals vs. Eliminations")

Here, we plot the residuals versus the number of eliminations per game. The plot looks symmetrical about \(y = 0\), which is an indication that there is not a significant deviation from the trend line.

5.7 Plotting Residuals Vs. Fitted Values

fitted = fitted(fn.lm)
trendscatter(residuals ~ fitted, 
    xlab = "Fitted Values",
     ylab = "Residuals",
     ylim = c(-1.5 * max(residuals), 1.5 * max(residuals)),
     main = "Residuals vs. Fitted Values")

Again, we see the plot being symmetrical about 0, further reinforcing the correctness of our model.

5.8 Normality Check

normcheck(fn.lm, shapiro.wilk = TRUE)

The null hypothesis for the Shapiro-Wilk normality test is that the errors are distributed normally.

\[H_0 = \epsilon \sim N(0, \sigma^2)\] Since the p-value = 0.093 and is greater than the alpha-level of 0.05, we must accept the null hypothesis and state that the errors are likely distributed normally like what we assumed.

6 Testing Another Model for Comparison

Now, we are going to see if a quadratic model is a better fit for our data as opposed to our linear model. The quadratic model is the following: \[y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2,\] where \(\beta_0\), \(\beta_1\), and \(\beta_2\) are unknown constant parameters.

6.1 Summary of the Quadratic Model

quad.lm = lm(DamageDealt ~ Eliminations + I(Eliminations^2), data = fn)

summary(quad.lm)
## 
## Call:
## lm(formula = DamageDealt ~ Eliminations + I(Eliminations^2), 
##     data = fn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -555.65  -76.71    1.15   88.54  429.41 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        104.119     51.983   2.003    0.051 .  
## Eliminations       198.969     23.320   8.532 4.14e-11 ***
## I(Eliminations^2)   -1.333      1.929  -0.691    0.493    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 192.7 on 47 degrees of freedom
## Multiple R-squared:  0.8996, Adjusted R-squared:  0.8954 
## F-statistic: 210.7 on 2 and 47 DF,  p-value: < 2.2e-16

From this, we can see that \(\beta_0 = 104.119\), \(\beta_1 = 198.969\), and \(\beta_2 = -1.333\). From this, we can make our equation for the quadratic model: \[104.119 + (198.969)x + (-1.333)x^2\]

6.2 Confidence Intervals for the Parameter Estimates

Here are the 95% confidence intervals for the parameter estimates:

ciReg(quad.lm)
##                   95 % C.I.lower    95 % C.I.upper
## (Intercept)             -0.45663         208.69426
## Eliminations           152.05476         245.88409
## I(Eliminations^2)       -5.21389           2.54885

6.3 Fitting a Quadratic Curve to the Points

plot(DamageDealt ~ Eliminations, 
     bg = "Blue", 
     pch = 21, 
     cex = 1.2,
     ylim = c(0, 1.1 * max(DamageDealt)),
     xlim = c(0, 1.1 * max(Eliminations)),
     main = "Quadratic Curve of Damage Dealt vs. Eliminations",
     data = fn)

curve(quad.lm$coef[1] + quad.lm$coef[2]*x + quad.lm$coef[3]*x^2, add = TRUE)

Here, we can see the scatter plot of damage dealt versus eliminations with a quadratic curve overlayed on top. Visually speaking, it stil looks fairly linear, but we will conduct further tests to make a better conclusion.

6.4 Plotting Residuals vs. Fitted Values

plot(quad.lm, which = 1)

Again, this plot mirrors the result we got from the linear model in that there is symmetry about \(y=0\).

6.5 Normality Check

normcheck(quad.lm, shapiro.wilk = TRUE)

Again, the null hypothesis of the Shapiro-Wilk normality test is that \(\epsilon \sim N(0, \sigma^2)\). Unlike the linear model though, the quadratic model’s p-value is 0.036, which is less than the alpha level of 0.05. Thus, we must reject the null hypothesis and conclude that the errors are likely not distributed normally at an alpha level of 0.05. Furthermore, for a polynomial regression model to be valid, we must assume that errors are distributed normally with a mean of 0 and a constant variance (Waissi). This means that the quadratic model is likely an invalid one to use for this data set.

6.6 Anova Test

To further reinforce the fact that the quadratic model should not be used over the linear one we will conduct an analysis of variance (anova) test.

anova(fn.lm, quad.lm)

Here, we can see that there is a large p-value. Thus, we can conclude that the linear model is a better fit for the data.

7 Avoiding Bias

To avoid bias in our model, we must examine outliers using a Cook’s distance plot. According to Wikipedia, data points with large residuals may distort the outcome and accuracy of a regression. Cook’s distance measures the effect of deleting a given observation. Points with a large Cook’s distance are considered to merit closer examination in the analysis.

cooks20x(fn.lm)

This means that matches #7, #46, and #48 need to be examined because it has an outstanding Cook’s distance.

To make a better model, I can remove the match with the highest Cook’s distance (match #46, where I got 15 eliminations):

fn.lm2 = lm(DamageDealt ~ Eliminations, data = fn[-46,])
summary(fn.lm2)
## 
## Call:
## lm(formula = DamageDealt ~ Eliminations, data = fn[-46, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -555.59  -67.42   -6.05  101.13  399.50 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    107.42      42.66   2.518   0.0153 *  
## Eliminations   191.64      10.60  18.081   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 190.3 on 47 degrees of freedom
## Multiple R-squared:  0.8743, Adjusted R-squared:  0.8716 
## F-statistic: 326.9 on 1 and 47 DF,  p-value: < 2.2e-16

For comparison, here is the summary of the original linear model:

summary(fn.lm)
## 
## Call:
## lm(formula = DamageDealt ~ Eliminations)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537.13  -81.85    5.18   91.70  440.56 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   126.615     40.291   3.142  0.00287 ** 
## Eliminations  184.103      8.925  20.628  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 191.7 on 48 degrees of freedom
## Multiple R-squared:  0.8986, Adjusted R-squared:  0.8965 
## F-statistic: 425.5 on 1 and 48 DF,  p-value: < 2.2e-16

As we can see, the residual standard error is only slightly lower, but the \(R^2\) value is also lower for the modified model.

8 Conclusion

8.1 Results to the Research Question

An linear model is best fit for this data, that is that the damage dealt to enemies per game increases linearly with the number of eliminations per game. I verified that all assumptions made for the linear model were indeed true. I showed that a quadratic model was invalid because the errors were likely to be not normally distributed. Thus, the final model for our data is:

\[\hat{y} = 126.615 + (184.103)x_i\]

8.2 Suggestions for Further Experimentation

A larger sample size will result in a more accurate model. Recording fifty matches takes a long time (even when playing video games) and I did not want to expend more time to record more as I thought this was a healthy balance between time and accuracy. Still, given I had more time, I would definitely have recorded more matches.

Given the nature of the game’s mechanics, it is difficult to get observations on the higher end of eliminations (7+ eliminations). For example, I had one game with 15 eliminations, which was an outlier in terms of the number of eliminations per game in my sample. On the other hand, it was extremely easy to get observations on the lower end (< 3 eliminations). As a result of this, as the number of eliminations increases, the model will get less and less accurate. To remedy this, again, more games would need to be played to get an ample amount of observations with high eliminations.

9 References

“Cook’s distance.” Wikipedia. www.en.wikipedia.org/wiki/Cook%27s_distance. Accessed 26 Apr. 2019.

Mendenhall, William M., and Terry L. Sinich. Statistics for Engineering and the Sciences, 6th edition. CRC Press, 2016.

Waissi, Gary R. “Polynomial Regression.” Arizona State University. www.public.asu.edu/~gwaissi/ASM-e-book/module402.html. Accessed 26 Apr. 2019.